209 research outputs found

    New Spiked-In Probe Sets for the Affymetrix HGU-133A Latin Square Experiment

    Get PDF
    The Affymetrix HGU-133A spike in data set has been used for determining the sensitivity and specificity of various methods for the analysis of microarray data. We show that there are 22 additional probe sets that detect spike in RNAs that should be considered as spike in probe sets. We assign each proposed spiked-in probe set to a concentration group within the Latin Square design, and examine the effects of the additional spiked-in probe sets on assessing the accuracy of analysis methods currently in use. We show that several popular preprocessing methods are more sensitive and specific when the new spike-ins are used to determine false positive and false negative rates

    Age-adjusted nonparametric detection of differential DNA methylation with case–control designs

    Get PDF
    Background: DNA methylation profiles differ among disease types and, therefore, can be used in disease diagnosis. In addition, large-scale whole genome DNA methylation data offer tremendous potential in understanding the role of DNA methylation in normal development and function. However, due to the unique feature of the methylation data, powerful and robust statistical methods are very limited in this area. Results: In this paper, we proposed and examined a new statistical method to detect differentially methylated loci for case control designs that is fully nonparametric and does not depend on any assumption for the underlying distribution of the data. Moreover, the proposed method adjusts for the age effect that has been shown to be highly correlated with DNA methylation profiles. Using simulation studies and a real data application, we have demonstrated the advantages of our method over existing commonly used methods. Conclusions: Compared to existing methods, our method improved the detection power for differentially methylated loci for case control designs and controlled the type I error well. Its applications are not limited to methylation data; it can be extended to many other case–control studies

    CMAX3: A Robust Statistical Test for Genetic Association Accounting for Covariates

    Get PDF
    The additive genetic model as implemented in logistic regression has been widely used in genome-wide association studies (GWASs) for binary outcomes. Unfortunately, for many complex diseases, the underlying genetic models are generally unknown and a mis-specification of the genetic model can result in a substantial loss of power. To address this issue, the MAX3 test (the maximum of three separate test statistics) has been proposed as a robust test that performs plausibly regardless of the underlying genetic model. However, the original implementation of MAX3 utilizes the trend test so it cannot adjust for any covariates such as age and gender. This drawback has significantly limited the application of the MAX3 in GWASs, as covariates account for a considerable amount of variability in these disorders. In this paper, we extended the MAX3 and proposed the CMAX3 (covariate-adjusted MAX3) based on logistic regression. The proposed test yielded a similar robust efficiency as the original MAX3 while easily adjusting for any covariate based on the likelihood framework. The asymptotic formula to calculate the p-value of the proposed test was also developed in this paper. The simulation results showed that the proposed test performed desirably under both the null and alternative hypotheses. For the purpose of illustration, we applied the proposed test to re-analyze a case-control GWAS dataset from the Collaborative Studies on Genetics of Alcoholism (COGA). The R code to implement the proposed test is also introduced in this paper and is available for free downloa

    A Method to Detect AAC Audio Forgery

    Get PDF
    Advanced Audio Coding (AAC), a standardized lossy compression scheme for digital audio, which was designed to be the successor of the MP3 format, generally achieves better sound quality than MP3 at similar bit rates. While AAC is also the default or standard audio format for many devices and AAC audio files may be presented as important digital evidences, the authentication of the audio files is highly needed but relatively missing. In this paper, we propose a scheme to expose tampered AAC audio streams that are encoded at the same encoding bit-rate. Specifically, we design a shift-recompression based method to retrieve the differential features between the re-encoded audio stream at each shifting and original audio stream, learning classifier is employed to recognize different patterns of differential features of the doctored forgery files and original (untouched) audio files. Experimental results show that our approach is very promising and effective to detect the forgery of the same encoding bit-rate on AAC audio streams. Our study also shows that shift recompression-based differential analysis is very effective for detection of the MP3 forgery at the same bit rate

    Detecting Differentially Methylated Loci for Multiple Treatments Based on High-Throughput Methylation Data

    Get PDF
    This article was originally published by BMC BioinformaticsBackground: Because of its important effects, as an epigenetic factor, on gene expression and disease development, DNA methylation has drawn much attention from researchers. Detecting differentially methylated loci is an important but challenging step in studying the regulatory roles of DNA methylation in a broad range of biological processes and diseases. Several statistical approaches have been proposed to detect significant methylated loci; however, most of them were designed specifically for case-control studies. Results: Noticing that the age is associated with methylation level and the methylation data are not normally distributed, in this paper, we propose a nonparametric method to detect differentially methylated loci under multiple conditions with trend for Illumina Array Methylation data. The nonparametric method, Cuzick test is used to detect the differences among treatment groups with trend for each age group; then an overall p-value is calculated based on the method of combining those independent p-values each from one age group. Conclusions: We compare the new approach with other methods using simulated and real data. Our study shows that the proposed method outperforms other methods considered in this paper in term of power: it detected more biological meaningful differentially methylated loci than others.The first author also acknowledges the support from the faculty research funds awarded by the School of Public Health, Indiana University Bloomington

    A distribution-free convolution model for background correction of oligonucleotide microarray data

    Get PDF
    IntroductionAffymetrix GeneChip® high-density oligonucleotide arrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high-level classification and cluster analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. Many proposed preprocessing methods are parametric, in that they assume that the background noise generated by microarray data is a random sample from a statistical distribution, typically a normal distribution. The quality of the final results depends on the validity of such assumptions. ResultsWe propose a Distribution Free Convolution Model (DFCM) to circumvent observed deficiencies in meeting and validating distribution assumptions of parametric methods. Knowledge of array structure and the biological function of the probes indicate that the intensities of mismatched (MM) probes that correspond to the smallest perfect match (PM) intensities can be used to estimate the background noise. Specifically, we obtain the smallest q2 percent of the MM intensities that are associated with the lowest q1 percent PM intensities, and use these intensities to estimate background. ConclusionUsing the Affymetrix Latin Square spike-in experiments, we show that the background noise generated by microarray experiments typically is not well modeled by a single overall normal distribution. We further show that the signal is not exponentially distributed, as is also commonly assumed. Therefore, DFCM has better sensitivity and specificity, as measured by ROC curves and area under the curve (AUC) than MAS 5.0, RMA, RMA with no background correction (RMA-noBG), GCRMA, PLIER, and dChip (MBEI) for preprocessing of Affymetrix microarray data. These results hold for two spike-in data sets and one real data set that were analyzed. Comparisons with other methods on two spike-in data sets and one real data set show that our nonparametric methods are a superior alternative for background correction of Affymetrix data

    A gene selection method for GeneChip array data with small sample sizes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods.</p> <p>Results</p> <p>We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate.</p> <p>Conclusion</p> <p>Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.</p

    Gene selection and classification for cancer microarray data based on machine learning and similarity measures

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money.</p> <p>Results</p> <p>To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others.</p> <p>Conclusions</p> <p>On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.</p

    Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data

    Get PDF
    Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods
    • …
    corecore